Spotify Recommendation System With Clustering

Author: Daniel Hassler

```{python}
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans, DBSCAN
from sklearn.model_selection import train_test_split, ParameterGrid
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
```

Data Analysis

Clustering algorithms can be applied to many real-world applications, including but not limited to security, anomaly detection, document clustering, stock market analysis, image compression, and so much more. The application I decided to approach with clustering is a song recommendation system. I found a dataset on Kaggle containing almost 114,000 songs from the popular music streaming platform Spotify. Each entry in the dataset consists of many features including artists, track_name, track_genre, popularity, danceability, and many more.

Below are some visualizations showcasing certain features in a scatterplot. This gives me a rough idea what the dataset looks like with all of these features and genres.

```{python}
original_df = pd.read_csv("./dataset.csv")
print(original_df.columns)
features_x = ["loudness", "popularity", "duration_ms"]
features_y = ["popularity", "energy", "tempo"]

for i, (x,y) in enumerate(zip(features_x, features_y)):
    scatter = sns.scatterplot(x=x, y=y, hue='track_genre', data=original_df, palette="viridis", alpha=0.25)
    legend_labels = original_df['track_genre'].unique()# [:3]  # Show only the first 3 genres
    scatter.legend(title='Genre', labels=legend_labels, prop={'size': 1})
    plt.title(f"Scatter Plot of {x} vs {y} by genre")
    plt.show()

plt.show()
```
Index(['Unnamed: 0', 'track_id', 'artists', 'album_name', 'track_name',
       'popularity', 'duration_ms', 'explicit', 'danceability', 'energy',
       'key', 'loudness', 'mode', 'speechiness', 'acousticness',
       'instrumentalness', 'liveness', 'valence', 'tempo', 'time_signature',
       'track_genre'],
      dtype='object')

K-Means

The K-Means algorithm clusters data by minimizing a criteria known as intertia, the within-cluster sum-of-squares. The formula for inertia, specified in the K-means documentation for Sklearn, is noted below:

Noting some of the variables in the summation: * n is the number of datapoints * mu is the mean of the cluster, also the cluster_centroid of the cluster C * ||x_i - \mu||^2 represents the squared euclidean distance between point x_i and the centroid * min() takes the min of the calculation

\[ \sum_{i=0}^{n}\min_{\mu_j \in C}(||x_i - \mu_j||^2) \]

A great benefit to K-means is its scalability to large sample sets.

Hyperparameter Tuning

```{python}
inertia = []
# train_df is the numeric representation of original_df
train_df = original_df.drop(columns=["track_id"])

for col in train_df.columns:
    if not pd.api.types.is_numeric_dtype(train_df[col]):
        train_df[col] = pd.factorize(original_df[col])[0]

scaler = StandardScaler()
# df_scaled is the scaled version of train_df
df_scaled = scaler.fit_transform(train_df)

for k in range(1, 200, 10):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit_predict(df_scaled)
    inertia.append(kmeans.inertia_)
```
D:\Users\dwh71\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Users\dwh71\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Users\dwh71\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Users\dwh71\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Users\dwh71\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Users\dwh71\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Users\dwh71\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Users\dwh71\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Users\dwh71\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Users\dwh71\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Users\dwh71\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Users\dwh71\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Users\dwh71\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Users\dwh71\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Users\dwh71\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Users\dwh71\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Users\dwh71\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Users\dwh71\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Users\dwh71\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Users\dwh71\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)

Plotting the Elbow Chart

```{python}
plt.plot(range(1, 200, 10), inertia, marker='o')
plt.title('Elbow Method for Optimal K')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia (Within-Cluster Sum of Squares)')
plt.show()
```

K-Means for Spotify

```{python}
kmeans = KMeans(n_clusters=23, random_state=42)
original_df['clusters'] = kmeans.fit_predict(df_scaled)
```
D:\Users\dwh71\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)

Evaluating K-Means for Spotify

```{python}
original_df['Distance_to_Centroid'] = kmeans.transform(df_scaled).min(axis=1)

def get_nearest_entry(idx, k=5):
    # print(original_df.iloc[idx])
    # print(train_df.iloc[idx])
    
    cluster = kmeans.predict(df_scaled[idx].reshape(1,20))[0]
    cluster_data = original_df[original_df["clusters"] == cluster]
    cluster_data["closest_entries_to_idx"] = (original_df["Distance_to_Centroid"] - cluster_data.loc[idx]["Distance_to_Centroid"]).abs()
    cluster_data = cluster_data.sort_values(by="closest_entries_to_idx")

    cluster_data.drop(columns=["closest_entries_to_idx"])
    print(f"Top {k} Closest Examples to {cluster_data.loc[idx]['artists']}'s \"{cluster_data.loc[idx]['track_name']}\"")
    print(cluster_data[:5][["artists", "album_name", "track_name", "track_genre"]])
    print("\n\n")

get_nearest_entry(44151) # rock song
get_nearest_entry(19015) # country song
get_nearest_entry(51136) # rap song
```
Top 5 Closest Examples to Daughtry's "Home"
                   artists                   album_name         track_name  \
44151             Daughtry                     Daughtry               Home   
22159             Currents            The Death We Seek  The Death We Seek   
6552          Dark Funeral  Where Shadows Forever Reign    Unchain My Soul   
3355   Shinsei Kamattechan               Boku no Sensou     Boku no Sensou   
44252          Linkin Park              A Thousand Suns          Robot Boy   

       track_genre  
44151       grunge  
22159  death-metal  
6552   black-metal  
3355   alternative  
44252       grunge  



Top 5 Closest Examples to Florida Georgia Line's "Stay"
                                    artists               album_name  \
19015                  Florida Georgia Line        Sad Country Songs   
7189                        Leftover Salmon             High Country   
727                         Takehara Pistol                    youth   
10465  Keith Mackenzie;DJ Fixx;Whiskey Pete                  Blazing   
31547               Jax Jones;Years & Years  Queda poco para la PAES   

             track_name track_genre  
19015              Stay     country  
7189       High Country   bluegrass  
727    全て身に覚えのある痛みだろう?    acoustic  
10465           Blazing   breakbeat  
31547              Play     electro  



Top 5 Closest Examples to Future;Lil Uzi Vert's "Tic Tac"
                   artists                album_name  \
51136  Future;Lil Uzi Vert        Tic Tac - Just Rap   
15533       A1 x J1;Nemzzz  Don’t Lie (feat. Nemzzz)   
39817        FFRAGEZEICHEN                   FR EP 2   
53209       Tiësto;KAROL G              Don't Be Shy   
24700            Moodymann                 Moodymann   

                     track_name     track_genre  
51136                   Tic Tac         hip-hop  
15533  Don’t Lie (feat. Nemzzz)           chill  
39817                Passed Out          german  
53209              Don't Be Shy           house  
24700                        No  detroit-techno  


C:\Users\dwh71\AppData\Local\Temp\ipykernel_4172\3089484603.py:9: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cluster_data["closest_entries_to_idx"] = (original_df["Distance_to_Centroid"] - cluster_data.loc[idx]["Distance_to_Centroid"]).abs()
C:\Users\dwh71\AppData\Local\Temp\ipykernel_4172\3089484603.py:9: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cluster_data["closest_entries_to_idx"] = (original_df["Distance_to_Centroid"] - cluster_data.loc[idx]["Distance_to_Centroid"]).abs()
C:\Users\dwh71\AppData\Local\Temp\ipykernel_4172\3089484603.py:9: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cluster_data["closest_entries_to_idx"] = (original_df["Distance_to_Centroid"] - cluster_data.loc[idx]["Distance_to_Centroid"]).abs()

Conclusion